This analysis is based on the premise that widening political divisions are not just about “what to do” but, more fundamentally, about “what the actual problem is.” For instance even the conservative National Review highlighted disagreement about the most important issues.
Here, using standard NLP (Natural Lanuguage Processing) techniques, I explore this question looking for differences in the lanuguage of presidential candidates using texts from recent Republican and Democratic debates. Key findings are:
1. “Wordcloud” visualization shows systematic differences between candidates, though more surprising are the similarities.
2. Word frquency analysis highlights positional differences between candidatates, but does not convey important context information.
3. Bigram toeknization and word-stem searches do the best job of revealing position differences between candidates.
The text of the presidential debates are downloaded from the UCSB Presidency Project. Transcripts were pasted into Apple Pages and stored as unformated .txt files. From that point all processing is done with R using capabilities of {tm} and associated libraries.
The quickest and most visual method to compare texts is word-frequency analysis using wordclouds. Not surprisingly, word choices vary significantly between candidates. However, there are also some striking similarities.
Let’s first just look at the word clouds of different candidates. We can address whether there are differences in the word frequencies used by candidates, as emphasized by the algorithm of the {wordcloud} package. Are there differences within the same party, between candidates, etc.?
Here are the word cloud of Donald Trump’s and Bernie Sanders’s dialogue at the debates. It’s surprising that their frequent word choices like people, country, and going are common, as if they are both painting a vision of the future that is personal (though radically different in nature).
c_wordcloud(trump_all)
c_wordcloud(sanders_all)
In this case word clouds couldn’t be more different. Hilary’s emphasizes think and people while Carly’s, a former business woman, primarily emphasizes government.
c_wordcloud(clinton_all)
c_wordcloud(fiorina_all)
Ted Cruz’s wordcloud emphasizes wonkish technicalities, like taxes and washington, while that of Mike Huckabee, a former minister, mixes the populist and a focus on government.
c_wordcloud(cruz_all)
c_wordcloud(huckabee_all)
We can also split the text by debate. Since the debates cover different topics and questions, one might expect to see this reflected in the text of the separate dialogues. What’s surprising here is how comparable the language of each candidate is between the debates. Perhaps the candidates are more interested in staying on message than answering questions directly?
c_wordcloud(candidate_text_tc("TRUMP", r_oct))
c_wordcloud(candidate_text_tc("TRUMP", r_nov))
c_wordcloud(candidate_text_tc("SANDERS", d_oct))
c_wordcloud(candidate_text_tc("SANDERS", d_nov))
A NOTE ON WORDCLOUDS Both visually appealing and interesting, wordclouds reveal differences between candidate word choices which hint at political preferences and opinions, but do not reveal significant detail about them. For instance, differences in positions on policy and important matters like taxes, terrorism, immigration, and class division are not deducible from the wordclouds.
We can check word frequency directly by simply tokenizing the text and counting single words. Looking for the most frequent words used by each candidate may reveal some clearer differences. To do this analysis some additional words like “thats”, “dont”, “back”, “can”, “get”, “cant”, and “come” are suppressed.
These tell a bit of a clearer story, i.e. we can almost read these words like sentences or sentence fragments. Both Bernie Sanders and Donald Trump again seems to come across similiarly as populists. The first ranking word for Donald Trump is country and for Bernie Sanders it’s believe. It’s interesting to note that the notion of getting the “country going” comes through in the top three candidates.
More humorously, note that the word “clinton” figures in both Carly Fiorina’s and Hilary Clinton’s lists. These texts almost read as an assertion by Hilary and an counter argument by Carly.
Ted Cruz’s frequent words again seem to focus on business and Carlo Rubio loves America.
While word frequencies reveal differences between the approaches and personalities of the candidates, they don’t by themsleves elucidate differences on specific policies or attitudes. Let’s try something else.
## Row.names trump sanders clinton fiorina all
## 1810 people 33 85 53 10 181
## 2483 think 9 55 90 9 163
## 1050 going 44 44 45 10 143
## 561 country 34 70 25 1 130
## 1363 know 23 26 56 19 124
## 2704 well 9 31 56 8 104
| word | clinton | fionrina | sanders | trump | NA |
|---|---|---|---|---|---|
| think | 9 | 55 | 90 | 9 | 163 |
| know | 23 | 26 | 56 | 19 | 124 |
| well | 9 | 31 | 56 | 8 | 104 |
| people | 33 | 85 | 53 | 10 | 181 |
| government | 0 | 7 | 6 | 40 | 53 |
| every | 4 | 15 | 9 | 26 | 54 |
| need | 5 | 33 | 36 | 18 | 92 |
| country | 34 | 70 | 25 | 1 | 130 |
| going | 44 | 44 | 45 | 10 | 143 |
There’s additional information in whether words used frequently by one candidate are used at all by another candidate. While the wordlcloud analysis gives some insight, we can analysis the data graphically for more quantitative information. Here is a graph of the “top” words used by all candidates
The above doesn’t reveal much more information than the wordcloud analysis does. However, we can also pick some “key words” and sample for their frequency. For a first stab, let’s try
key_words = c("tax", "government", "climate", "class", "wall", "street","terror", "economy", "immigrant", "america", "veteran", "drug", "health", "gun", "education", "bankruptcy", "money", "women", "war", "rights", "abortion", "violence")
## Row.names trump sanders clinton fiorina all rank
## 1062 government 0.0000000000 0.001622624 0.0012992638 0.025316456 53 1
## 2671 wall 0.0065281899 0.006722299 0.0023819835 0.001265823 53 2
## 2447 tax 0.0071216617 0.002549838 0.0008661758 0.010759494 44 3
## 2377 street 0.0005934718 0.006490496 0.0025985275 0.001265823 43 4
## 1605 money 0.0065281899 0.004404265 0.0006496319 0.004430380 40 5
## 1128 health 0.0000000000 0.004172462 0.0030316154 0.001898734 35 6
Since word fequency does not convey specific positions on issues, let’s look at word associations to see if we can get closer to meaning from more information about the context of word usage. This analysis simply tokenizes the text as bigrams, then uses a simple function
bigram_table[grep(word, rownames(bigram_table), ignore.case=TRUE)]
to pull out relevant terms from the torkenized TDM. A key challenge is that the texts are relatively short, so statistics comparing the word frequencies are poor. Nevertheless, we can see that context around different words, even at the relatively unsophisticated level of simple bigrams, starts to hint at differences in approach to problems.
Bernie talks about “tax” and “terror” as well. His discussion of taxes has a reformist bent, but where Carly Fiorina talks associates words like budgeting, changes, reform, simplify, code, reform, and plan, Bernie Sanders associates words like cap, income, must, share, speculation, breaks, reform, wall, and rebuilding.